Visualizing NYC Energy and Gas statistics
By: David Zhang and Hamim Choudhury
A general Overview of the original city Datasets
<class 'pandas.core.frame.DataFrame'> RangeIndex: 103259 entries, 0 to 103258 Columns: 265 entries, Calendar Year to Neighborhood Tabulation Area (NTA) (2020) dtypes: bool(1), float64(8), int64(8), object(248) memory usage: 208.1+ MB None
This amount of data is unnecessary for the scope of our project. Also many specific data points are stored as NULL or "not available". So we will parse through the data and extract the most interesting and useful columns.
KEEP_COLS = [
"Property ID",
"Property Name",
"Address 1",
"Postal Code",
"Year Built",
"Primary Property Type - Portfolio Manager-Calculated",
"Latitude",
"Longitude",
"Borough",
"Census Tract (2020)",
"Neighborhood Tabulation Area (NTA) (2020)",
"Property GFA - Calculated (Buildings) (ft²)",
"Site Energy Use (kBtu)",
"Weather Normalized Site Energy Use (kBtu)",
"Electricity Use - Grid Purchase (kWh)",
"Natural Gas Use (therms)",
"Site EUI (kBtu/ft²)",
"Weather Normalized Site EUI (kBtu/ft²)",
"ENERGY STAR Score",
"Total (Location-Based) GHG Emissions (Metric Tons CO2e)",
"Total (Location-Based) GHG Emissions Intensity (kgCO2e/ft²)",
"Occupancy",
"Fuel Oil #2 Use (kBtu)",
"District Steam Use (kBtu)",
"Office - Weekly Operating Hours",
"Office - Number of Workers on Main Shift",
"Multifamily Housing - Total Number of Residential Living Units"
]
With our new parquet file we can start looking for any interesting patterns and findings
Before we start visualizing data, we have to understand what some of the data means.
GFA or gross floor area refers to the size of a building in square feet.
Site energy use and weather normalized site energy use refer to the total amount of energy spent.
Site EUI (energy use intensity) is a measure of how much energy a property uses compared to its size.
GHG (green house gases) emissions is how much gas is released converted to be represented in the equivalent mass of CO2.
Energy Star certification is a label given to buildings that operate more efficiently than 75% of other buildings.
def clean_numeric_column(series):
"""Convert string columns with commas to numeric"""
if series.dtype == 'object':
return pd.to_numeric(series.str.replace(',', ''), errors='coerce')
return series
numeric_cols = [
'Property GFA - Calculated (Buildings) (ft²)',
'Site Energy Use (kBtu)',
'Weather Normalized Site Energy Use (kBtu)',
'Electricity Use - Grid Purchase (kWh)',
'Natural Gas Use (therms)',
'Site EUI (kBtu/ft²)',
'Weather Normalized Site EUI (kBtu/ft²)',
'ENERGY STAR Score',
'Total (Location-Based) GHG Emissions (Metric Tons CO2e)',
'Total (Location-Based) GHG Emissions Intensity (kgCO2e/ft²)',
'Fuel Oil #2 Use (kBtu)',
'District Steam Use (kBtu)'
]
for col in numeric_cols:
if col in df.columns:
df[col] = clean_numeric_column(df[col])
# Create shorter column names for easier plotting
df['GFA'] = df['Property GFA - Calculated (Buildings) (ft²)']
df['Site_Energy'] = df['Site Energy Use (kBtu)']
df['Weather_Norm_Energy'] = df['Weather Normalized Site Energy Use (kBtu)']
df['Electricity'] = df['Electricity Use - Grid Purchase (kWh)']
df['Gas'] = df['Natural Gas Use (therms)']
df['Site_EUI'] = df['Site EUI (kBtu/ft²)']
df['Weather_Norm_EUI'] = df['Weather Normalized Site EUI (kBtu/ft²)']
df['ENERGY_STAR'] = df['ENERGY STAR Score']
df['GHG_Total'] = df['Total (Location-Based) GHG Emissions (Metric Tons CO2e)']
df['GHG_Intensity'] = df['Total (Location-Based) GHG Emissions Intensity (kgCO2e/ft²)']
df['Property_Type'] = df['Primary Property Type - Portfolio Manager-Calculated']
# Remove extreme outliers for better visualization
df_clean = df[
(df['GFA'] > 0) &
(df['Site_EUI'] > 0) &
(df['Site_EUI'] < df['Site_EUI'].quantile(0.99))
].copy()
print(f"Original data: {len(df)} rows")
print(f"Cleaned data: {len(df_clean)} rows")
Now that we have a general understanding of what the data represents, we can now start analyzing any trends and statistics.
We can first start with a general visualization of the distribution of our data.
# 1. Distribution of Properties by Borough (Bar Chart)
borough_counts = df_clean['Borough'].value_counts().reset_index()
borough_counts.columns = ['Borough', 'Count']
# Create Plotly bar chart
fig_borough = px.bar(
borough_counts,
x='Borough',
y='Count',
title='Distribution of Properties by Borough',
labels={'Count': 'Number of Properties', 'Borough': 'Borough'},
color='Borough',
template='plotly_white',
autoresize=True
)
# Order the bars by count (largest first)
fig_borough.update_layout(xaxis={'categoryorder': 'total descending'})
fig_borough.show()
Now we can answer some questions like ...
How does the age of a building correlate to its energy performance?
Now we can move onto analyzing how different attributes of buildings affect its emmisions and energy consumption.
Parameters like the building type, size, and occupancy can heavily affect the output.
Now lets take a deeper look into the statistics of each borough and find out which borough is superior...
Of course its Brooklyn
We can end this off with a general correlation matrix to see the overall relationship of the dataset.